Introduction

Below, I will be exploring a tidy dataset of white wine to find out what chemical properties most influence the sensory quality of taste. I’ll also look at the chemical properties that have a close relation to others. The quality variable is based on taste testing scores between 0 and 10 (worst to best), all the other variables are based on physiochemical tests.

I want to start by looking at the structure of the dataset.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

I am going to remove column “X”, as this column is an identifier, and I don’t want these numbers included in any statistics. After that, there will be 11 input variables and 1 output variable (quality), each with 4898 observations.


Univariate Plots Section

For each variable, I will start by plotting the original data in a histogram and a boxplot to view the distribution and outliers. This will be followed by a histogram and boxplot of the current subset. I will then remove outliers from the variable in the subset, and plot again.

The method I will be using to remove outliers is by identifying outliers 1.5 x the upper and lower InterQuartile (IQR) bounds (1.5 * IQR).

Quality

original:

*The original quality distribution is mostly normal with a few outliers in the upper and lower ends of the range.

subset with outliers removed:

## [1] 4698   12

200 outliers were removed from the original data. This leaves 4698 observations with quality scores only between 4 and 7 in the new subset.


Fixed Acidity

original:

subset:

In the original data, fixed.acidity appears to be normally distributed in the histogram and slightly right skewed. The boxplot shows that there are a few outliers on the upper and lower ends of the range.

The subset data isn’t much different, since the only rows that have been removed are the outliers in the quality variable. I will now remove the fixed.acidity outliers in the subset dataframe.

subset with outliers removed:

## [1] 4560   12

138 outliers removed and fixed.acidity is now normally distributed.


Volatile Acidity

original:

subset:

In the original data, volatile.acidity is bell-curve shaped with uniform distribution on both sides of the peak but becomes slightly right-skewed. There are several outliers in the upper end of the range.

The subset data shows almost the same as the original data. I’ll now remove the volatile.acidity outliers in the subset dataframe.

## [1] 4398   12

162 outliers removed and the distribution is uniform across the peak, then becomes slightly right-skewed but mostly normal.


Citric Acid

original:

subset:

Of the original dataset, citric acid distribution is slightly right-skewed with a small spike in the upper range of the shoulder of the curve. There are outliers in the upper and lower ends of the range as shown in the boxplot.

The subset data is close to the original dataset pattern.

## [1] 4138   12

260 outliers removed and the histogram shows that the density peak is flattened out a bit. The small spike is still apparent in the upper range of the shoulder of the curve. The boxplot shows that there are some new outliers, but I will leave these in the data, as I expect they are from the small data spike.


Residual Sugar

original:

subset:

In the original data, residual sugar distribution is very right-skewed with a high number of observations on the lower end of the range. The boxplot shows that there are outliers in the upper end.

In the subset data, some of the upper-end outliers have already been removed from previous outlier removal. The data is still heavy on the lower end of the range followed by a more uniform spread of the higher values.

## [1] 4130   12

8 outliers were removed and the data is right-skewed.


Chlorides

original:

subset:

Both the original dataset and the subset show that chlorides have a mostly normal distribution, but with outliers in the upper and lower ends of the range.

## [1] 3985   12

145 outliers removed and the data has a better, normal distribution.


Free Sulfur Dioxide

original:

subset:

Both the original dataset and the subset show that free sulfur dioxide has a mostly normal distribution with outliers in the upper end.

## [1] 3942   12

43 outliers removed and the distribution is normal.


Total Sulfur Dioxide

original:

subset:

In the original dataset, total sulfur dioxide has a wide distribution with outliers in the upper and lower ends of the range.

The subset data is more normalized but with some outliers in the upper end of the range.

## [1] 3939   12

3 outliers were removed, and total sulfur dioxide has a normal distribution.


Density

original:

subset:

In the original dataset, density distribution is mostly normal, slightly right-skewed and has outliers in the upper end of the range.

The subset data is more normalized, slightly right-skewed, and with no outliers shown in the boxplot. For consistency reasons, I will still perform the outlier calculations on the subset.

## [1] 3939   12

As I suspected, no outliers were removed after the calculations.


pH

original:

subset:

In both the original dataset and the subset, pH has a mostly symmetrical, wide distribution with outliers on the upper and lower ends of the range.

## [1] 3885   12

54 outliers were removed and the distribution is normal.


Sulphates

original:

subset:

In both the original dataset and the subset, sulphates have a distribution that is slightly right-skewed, with outliers in the upper end of the range.

## [1] 3797   12

88 outliers removed and the data is bimodal, and slightly right-skewed.


Alcohol

original:

subset:

In both the original dataset and the subset, alcohol has a right-skewed, wide, and somewhat uniform distribution with no outliers. However, as I did above, I will perform the outlier calculation for consistency.

## [1] 3797   12

As suspected, no outliers were removed after the calculations.


Univariate Analysis

  1. What is the structure of your dataset?
  1. What is/are the main feature(s) of interest in your dataset?
  1. What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
  1. Did you create any new variables from existing variables in the dataset?
  1. Of the features you investigated, were there any unusual distributions?
  1. Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

I think that the fastest way to identify correlations between variables would be to create a pairs plot.

Based on the plot above, there are a few variables that I’d like to create individual correlation plots to get a better view.

## [1] 0.8474419

Density and residual sugar have the strongest correlation with a score of 0.85. The plot confirms this within a tight relationship and a positive correlation.


## [1] 0.6210167

Free sulfur dioxide and total sulfur dioxide have the next strongest correlation with a score of 0.62. Even though free sulfur dioxide is a continuous variable, there does seem to be a slight pattern of discrete grouping in the way the variable is recorded.


## [1] 0.554056

Density and total sulfur dioxide have a correlation score of 0.55. The plot confirms a positive correlation between these two variables.


## [1] 0.5035444

Density and chlorides have a correlation score of 0.50. This is also confirmed with the positive correlation plot between these two variables.


## [1] -0.8195218

Density vs alcohol is my strongest negative correlation with a score of -0.82. This is confirmed in the plot with a strong negative correlation.


## [1] 0.413995

This plot shows that most of the alcohol observations are in quality score 6, with an overall, positive correlation score of 0.41.


Bivariate Analysis

  1. How did the feature(s) of interest vary with other features in the dataset?
  • From the pairs correlation plot, I could see that as citric acid increases, the quality decreases. When free sulfur dioxide is low, there is a positive correlation with quality. The quality then starts dropping as free sulfur dioxide increases. This is the same with total sulfur dioxide. Alcohol was the highest-scoring input variable with quality.
  1. Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
  • I found it interesting that density and alcohol have the lowest correlation, yet density and residual.sugar have the strongest correlation.
  1. What was the strongest relationship you found?
  • For the input variables, residual.sugar and density had the highest correlation with a score of 0.839. The highest correlation with the output variable of quality was with the alcohol variable. The strongest negative correlation was between density and alcohol with a score of -0.78.

Multivariate Plots Section

Of the plots above, I’d like to include quality and see the correlations.

Here, I can see that the quality score of 5 has the strongest correlation between residual.sugar and density.


total.sulfur.dioxide and free.sulfur.dioxide have an almost uniform positive correlation with every unique quality observation with 5 being the strongest.


total.sulfur.dioxide and density also have a positive correlation with every unique quality observation. The quality score of 4 stands out, however, the quality score of 5 seems to be the strongest.


Much like the plot above, residual.sugar and total.sulfur.dioxide have a positive correlation with every unique quality observation. The quality score of 4 stands out, however, the quality score of 7 seems to have the strongest positive correlation path.


This plot shows that as alcohol content decreases and density increases, the quality drops.


Multivariate Analysis

  1. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
  • Yes. I found that chlorides and density had a strong correlation with the higher scores of quality.
  1. Were there any interesting or surprising interactions between features?
  • Yes. The correlation between alcohol and quality is positive yet, none of the other input variables had a strong positive correlation with alcohol.

Final Plots and Summary

Plot One

Description One

I chose alcohol content as my first plot because I found it interesting how wide and somewhat uniform the distribution is compared to all the other variables.


Plot Two

Description Two

I chose this as my second plot to show how there is a strong positive correlation between density and residual.sugar, yet a strong negative correlation between density and alcohol.


Plot Three

Description Three

I chose this plot to reflect the strong positive correlation between residual.sugar and density with quality factored in the point colors and correlation lines.


Reflection

From this analysis, I found that better tasting white wine is low in density and low in residual sugar. I was also surprised to discover how white wine with a high alcohol by volume improved quality score only by perception, as this variable did not correlate to any other chemical variable. This makes me wonder if the taste testers knew and were influenced by information about each of the white wines they tested, affecting the quality score.

Some of the struggles I had were sticking to the idea of a “quick and dirty” analysis. Through much of this project, I wanted to make beautiful looking plots, so I spent a lot of wasted time researching plot aesthetics and not analyzing the data. Once I realized what I was doing and that I needed to keep it “quick and dirty” the analysis and plotting came more easily. However, after my revision, most of my plots are not “quick and dirty”.

I do think this dataset could be used in machine learning for predictive models of what physiochemicals influence the taste of white wine. However, it should be used alongside many other similar datasets from controlled taste testing events. It should also be noted that the sensory quality score of wine would most likely vary in different regions of the world. What is considered a great tasting wine in one region, may be completely opposite in another.

The dataset used in this analysis was downloaded from:

For more information regarding this dataset, please visit these sites [Cortez et al., 2009]: